欢迎关注“小丫画图”公众号,回复“小白”,看小视频,实现点鼠标跑代码。

小丫微信: epigenomics E-mail:

作者:Caelum,他的更多作品看这里https://k.youshop10.com/M4k338pp

小丫编辑校验

需求描述

画出CNV的这个图。

出自https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-021-01322-w,跟FigureYa259circLink和FigureYa262GDC出自同一篇文章

Fig. 1 Genetic and transcriptional alterations of RNA modification “writers” in CRC. c Bar graphs showing the frequency of CNV gain (red), loss (blue) and non_CNV (green) of RNA modification “writers” in the TCGA-COAD/READ cohort. The height of each bar represents the alteration frequency.

应用场景

外显子测序数据可以画这样的图。

RNA-seq数据也可以借助这样的分析来深挖。例如RNA-seq筛出成百上千个差异基因,谁才是关键基因?它通过什么方式影响了下游基因的表达?或许是某些基因发生了高频copy number variation(gain/loss)。

环境设置

使用国内镜像安装包

options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")

加载包

library(magrittr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
library(TCGAbiolinks)
library(clusterProfiler)
## 
## clusterProfiler v3.16.1  For help: https://guangchuangyu.github.io/software/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following object is masked from 'package:IRanges':
## 
##     slice
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:purrr':
## 
##     simplify
## The following object is masked from 'package:stats':
## 
##     filter
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor

输入数据

从GDC下载hg38的CNV数据,以COAD/READ为例。

easy_input_gene.txt,要看哪些基因的CNV,就写到这个文件里。甚至可以把所有差异基因或参与某个通路的基因都拿来计算,查看输出文件output_cnv.csv。然后筛选gain或loss频率高的前十几个基因,进行画图展示。

cnv_COAD <- GDCquery(project = "TCGA-COAD",
                data.category = "Copy Number Variation",
                data.type = "Gene Level Copy Number Scores") %T>%
  GDCdownload %>% GDCprepare
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-COAD
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
## Downloading data for project TCGA-COAD
## Of the 1 files for download 1 already exist.
## All samples have been already downloaded
## Reading GISTIC file
## Reading file: GDCdata/TCGA-COAD/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/7f01e47a-2f4c-4b91-8db6-e1b6b5595390/COAD.focal_score_by_genes.txt
## 
indexing COAD.focal_score_by_genes.txt [=------------------] 34.31GB/s, eta:  0s
indexing COAD.focal_score_by_genes.txt [==================] 208.03MB/s, eta:  0s
                                                                                
dim(cnv_COAD)
## [1] 19729   509
cnv_READ <- GDCquery(project = "TCGA-READ",
                data.category = "Copy Number Variation",
                data.type = "Gene Level Copy Number Scores") %T>%
  GDCdownload %>% GDCprepare
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-READ
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ----------------
## oo Checking data
## ----------------
## ooo Check if there are duplicated cases
## ooo Check if there results for the query
## -------------------
## o Preparing output
## -------------------
## Downloading data for project TCGA-READ
## Of the 1 files for download 1 already exist.
## All samples have been already downloaded
## Reading GISTIC file
## Reading file: GDCdata/TCGA-READ/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/8dcb2bfd-0e8a-402d-941c-def06f2e8a96/READ.focal_score_by_genes.txt
## 
indexing READ.focal_score_by_genes.txt [=====--------------] 92.45GB/s, eta:  0s
indexing READ.focal_score_by_genes.txt [==================] 282.75MB/s, eta:  0s
                                                                                
dim(cnv_READ)
## [1] 19729   173
# 合并COAD和READ
cnv <- cbind(cnv_COAD, cnv_READ)
dim(cnv)
## [1] 19729   682
# 删除重复的sample
cnv <- cnv[,!duplicated(colnames(cnv))]
dim(cnv)
## [1] 19729   679
# 根据需要定义一个基因集
# 可以从文件读入
genes <- read.table("easy_input_gene.txt", header = T)$SYMBOL
# 也可以直接在这里写入
#genes <- c("METTL3", "METTL14", "FTO", "ALKBH5","IGF2BP3",
#           "YTHDF1", "YTHDF2", "YTHDF3", "IGF2BP1", "IGF2BP2", )

计算gain and loss frequency

cnv2 <- cnv %>%
  # 理解ENSEMBL命名方法,提取前15位作为基因名
  mutate(ENSEMBL = str_sub(`Gene Symbol`, 1, 15)) %>%
  # 转换为基因名SYMBOL
  inner_join(bitr(.$ENSEMBL,
                  fromType = "ENSEMBL",
                  toType = "SYMBOL",
                  OrgDb = org.Hs.eg.db)) %>% 
  
  # 去除转换时可能产生的重复项
  distinct(SYMBOL, .keep_all = TRUE) %>% 
  # 仅保留目的基因
  filter(SYMBOL %in% genes) %>% 
  # 将SYMBOL作为行名,去除非数字列干扰
  column_to_rownames("SYMBOL") %>% 
  # 去除其他干扰信息
  dplyr::select(-`Gene ID`, -Cytoband, -ENSEMBL, -`Gene Symbol`) %>% 
  # 理解TCGA barcode,仅保留肿瘤样本数据
  select_if(str_sub(colnames(.), 14, 15) < 10) %>% 
  
  # +1代表gain,-1代表loss,统计每一行的loss和gain
  mutate(CNV_loss = apply(., 1, function(x) sum(x == -1)/length(x)),
         CNV_gain = apply(., 1, function(x) sum(x == 1)/length(x))) %>% 
  # 仅保留loss和gain数据
  dplyr::select(CNV_loss, CNV_gain) %>% 
  # 重新把行名当做基因名
  rownames_to_column("gene") %>% 
  
  # 统计没有CNV的样本的比例
  mutate(none_CNV = 1 - CNV_loss - CNV_gain) %>% 
  # 将宽数据转变为长数据,便于绘图
  pivot_longer(!gene, names_to = "Group", values_to = "pct") %>% 
  
  # 转换为百分比
  mutate(pct = pct * 100)
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(.$ENSEMBL, fromType = "ENSEMBL", toType = "SYMBOL", OrgDb =
## org.Hs.eg.db): 3.48% of input gene IDs are fail to map...
## Joining, by = "ENSEMBL"
# 保存到文件
write.csv(cnv2, "output_cnv.csv", row.names = F, quote = F)

开始画图

# 按照easy_input_gene.txt文件中的顺序画基因
#cnv2$gene <- factor(cnv2$gene, levels = genes)
# 或者按照CNV_gain由大到小的顺序画基因
cnv2_gain <- cnv2[cnv2$Group == "CNV_gain",]
cnv2_gain_sorted <- cnv2_gain[order(cnv2_gain$pct, decreasing = T),] 
cnv2$gene <- factor(cnv2$gene, levels = cnv2_gain_sorted$gene)

# 按照loss、gain、none的顺序画bar
cnv2$Group <- factor(cnv2$Group, levels = c("CNV_loss", "CNV_gain", "none_CNV"))

ggplot(cnv2, aes(x = gene, y = pct, fill = Group)) +
  # 绘制堆积条形图
  geom_col(position = "stack") +
  # 添加标题
  ggtitle(paste0("TCGA-COAD/READ\nn=", ncol(cnv) - 3)) +
  # 手动添加颜色
  scale_fill_manual(values = c("steelblue3", "firebrick2",  "forestgreen")) +
  xlab("m6A genes") +
  ylab("Frequency of group(%)") +
  theme_classic() +
  theme(axis.line = element_line(size = 1, lineend = "square"),
        axis.title = element_text(size = 15,face = "bold"),
        axis.text.y = element_text(size = 12,face = "bold"),
        axis.text.x = element_text(size = 12,face = "bold", angle = 90, hjust = 1),
        axis.ticks.length = unit(0.25,"cm"),
        axis.ticks = element_line(size = 1),
        legend.text = element_text(size = 12,face = "bold"),
        legend.title = element_text(size = 15,face = "bold")) 

ggsave("CNV.pdf", width = 6, height = 4)

我们算出的frequency跟原文差距较大,跟cBioPortal结果相近。跟原文作者联系,还没有回复。

Session Info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] clusterProfiler_3.16.1 TCGAbiolinks_2.16.4    org.Hs.eg.db_3.11.4   
##  [4] AnnotationDbi_1.50.3   IRanges_2.22.2         S4Vectors_0.26.1      
##  [7] Biobase_2.48.0         BiocGenerics_0.34.0    forcats_0.5.1         
## [10] stringr_1.4.0          dplyr_1.0.7            purrr_0.3.4           
## [13] readr_2.1.1            tidyr_1.1.4            tibble_3.1.6          
## [16] ggplot2_3.3.5          tidyverse_1.3.1        magrittr_2.0.1        
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                backports_1.4.1            
##   [3] fastmatch_1.1-3             BiocFileCache_1.12.1       
##   [5] plyr_1.8.6                  igraph_1.2.10              
##   [7] splines_4.0.2               BiocParallel_1.22.0        
##   [9] GenomeInfoDb_1.24.2         urltools_1.7.3             
##  [11] digest_0.6.29               yulab.utils_0.0.4          
##  [13] htmltools_0.5.2             GOSemSim_2.14.2            
##  [15] viridis_0.6.2               GO.db_3.11.4               
##  [17] fansi_0.5.0                 memoise_2.0.1              
##  [19] tzdb_0.2.0                  graphlayouts_0.7.2         
##  [21] modelr_0.1.8                matrixStats_0.61.0         
##  [23] vroom_1.5.7                 R.utils_2.11.0             
##  [25] askpass_1.1                 enrichplot_1.8.1           
##  [27] prettyunits_1.1.1           colorspace_2.0-2           
##  [29] blob_1.2.2                  rvest_1.0.2                
##  [31] rappdirs_0.3.3              ggrepel_0.9.1              
##  [33] haven_2.4.3                 xfun_0.29                  
##  [35] crayon_1.4.2                RCurl_1.98-1.5             
##  [37] jsonlite_1.7.2              scatterpie_0.1.7           
##  [39] glue_1.5.1                  polyclip_1.10-0            
##  [41] gtable_0.3.0                zlibbioc_1.34.0            
##  [43] XVector_0.28.0              DelayedArray_0.14.1        
##  [45] scales_1.1.1                DOSE_3.14.0                
##  [47] DBI_1.1.1                   Rcpp_1.0.7                 
##  [49] viridisLite_0.4.0           progress_1.2.2             
##  [51] gridGraphics_0.5-1          bit_4.0.4                  
##  [53] europepmc_0.4.1             httr_1.4.2                 
##  [55] fgsea_1.14.0                RColorBrewer_1.1-2         
##  [57] ellipsis_0.3.2              pkgconfig_2.0.3            
##  [59] XML_3.99-0.8                R.methodsS3_1.8.1          
##  [61] farver_2.1.0                sass_0.4.0                 
##  [63] dbplyr_2.1.1                utf8_1.2.2                 
##  [65] labeling_0.4.2              ggplotify_0.1.0            
##  [67] tidyselect_1.1.1            rlang_0.4.12               
##  [69] reshape2_1.4.4              munsell_0.5.0              
##  [71] cellranger_1.1.0            tools_4.0.2                
##  [73] cachem_1.0.6                downloader_0.4             
##  [75] cli_3.1.0                   generics_0.1.1             
##  [77] RSQLite_2.2.9               ggridges_0.5.3             
##  [79] broom_0.7.10                evaluate_0.14              
##  [81] fastmap_1.1.0               yaml_2.2.1                 
##  [83] knitr_1.37                  bit64_4.0.5                
##  [85] fs_1.5.2                    tidygraph_1.2.0            
##  [87] ggraph_2.0.5                R.oo_1.24.0                
##  [89] DO.db_2.9                   xml2_1.3.3                 
##  [91] biomaRt_2.44.4              compiler_4.0.2             
##  [93] rstudioapi_0.13             curl_4.3.2                 
##  [95] reprex_2.0.1                tweenr_1.0.2               
##  [97] bslib_0.3.1                 stringi_1.7.6              
##  [99] highr_0.9                   lattice_0.20-45            
## [101] Matrix_1.4-0                vctrs_0.3.8                
## [103] pillar_1.6.4                lifecycle_1.0.1            
## [105] BiocManager_1.30.16         triebeard_0.3.0            
## [107] jquerylib_0.1.4             data.table_1.14.2          
## [109] cowplot_1.1.1               bitops_1.0-7               
## [111] GenomicRanges_1.40.0        qvalue_2.20.0              
## [113] R6_2.5.1                    gridExtra_2.3              
## [115] MASS_7.3-54                 assertthat_0.2.1           
## [117] SummarizedExperiment_1.18.2 openssl_1.4.5              
## [119] withr_2.4.3                 GenomeInfoDbData_1.2.3     
## [121] hms_1.1.1                   ggfun_0.0.4                
## [123] grid_4.0.2                  rvcheck_0.2.1              
## [125] rmarkdown_2.11              ggforce_0.3.3              
## [127] lubridate_1.8.0